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ABSTRACT 

A needs assessment of the National Assessment of 
Educational Progress (NAEP) is presented. It deals with cost, design 
and technical issues, and utility. Suggestions include cost reduction 
via assessment schedule cutbacks and re-use of released NAEP 
exercises; and a shift from federal to private funding by selling 
NAEP exercises with interpretative materials to schools or 
individuals. A major unresolved question concerns the validity of 
inferences which can be drawn from the aggregated results. NAEP 
validation procedures constitute content validation. However, content 
validation does not necessarily constitute validity evidence at all, 
if validity evidence must bear on the interpretations that are 
warranted on the basis of test or assessment resultc. This means that 
more work needs to be done on the construct validation of the NAEP 
results as they are now commonly interpreted, i.e., with results 
aggregated across exercises. NAEP needs to become more clear in 
reporting aggregate results by specifying the facets of variance over 
which it is, and is not, attempting to generalize. Two strategies^are 
suggested for the utility of NAEP results: developing norms for NAEP 
exercises, and making NAEP exerc i ses -and data more" readily available 
to independent investigators. (PN) 
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WHAT COULD BE DONE DIFFERENTLY WITH NAEP? 
in this brief memo, I set out a range of ideas for what could be done 
differently with and for NAEP. Ideas are organized roughly into three 
categories, pertaining to cost, design and technical issues, and utility. 
I do not discuss administrative issues, mainly because I don't know much 
about current administration of NAEP - though I do realize from reading 
Hazlett that administrative and bureaucratic issues have had a sharp impact 
on cost, design and utility of NAEP. 
Cost 

NAEP is without 'a doubt quite expensive and it is an expense borne 
almost exclusively by federal government coffers. There are numerous 
reasons for concern over this situation, but the main one is the recent 
general cutback in federal funds for education and research. 

Wirtz and Lapointe discussed a number of different possible costsaving 
measures -- including very briefly the possibility of discontinuing NAEP. 
I won't attempt to review their suggestions here in any detail. 

It is clear that there are basically two different strategies for remedying 
the problem of much public money going into NAEP. The first is singly to 
reduce the cost of NAEP, and the second is to find .ways to have NAEP pay 
its own way with other than public funds.* Let me discuss examples with 
regard to each strategy- 



*Wirtz and Lapointe 's suggestions about getting states to pay more of NAEP's 
costs seem a bit unrealistic to me. Most state governments are under as , 
m...y. fiscal pressure as the federal government just now. Also, there appears 
Tbe a rrend in the pLt 5 years or so, for states to move away ^-^-P^^ 
assessments like NAEP, toward census-type testxng as xn mxnxmum competency 



testing. 



Simply reducing costs coulc be accomplished in a variety ot ways. Indeed 
it already has been, as NAEP has cut back on the assessment schedule and 
in recent years has largely excluded the young adult sample. However, one 
fairly simple way of saving money (which for reasons that escape me) has 
apparently not yet been tried would be to reuse released NAEP exercises. 
I have not seen detailed cost breakdowns for exercise development, but simply 
from descriptions of the process, I assume that exercise development is a 
fairly I'irge expense. It is, of course, widely assumed that exercises or 
items disclosed in the public domain are no longer valid for operational 
\ise. Whatever the merits of this premise in general (there was considerable 
debate over it, for example, in hearings on "truth-in-testing" legislation), 
it seems to me to have little merit with respect to NAEP. The major argument 
typically advanced against operational use of released'^ items is that prior 
familiarity with an item may invalidate it as a measure of the real skills 
or knowledge of interest. Subjects may have simply memorized the intended 
answers. 

However, there are four reasons why I think this argument does not 
pertain to NAEP; 

r ■ 

- First, there is such a large number of NAEP exercises that it seems 
very likely that very few people, if anyon e , could familiarize 
themselves with (much less memorize) all or even most of the released 
exercises. 

- Second, individuals who will be assessed lack incentive to do this. 

Since no direct consequence at all flow from NAEP assessemnts for 
individuals, assessed, there is no reason for them a priori to 
familiar^ize themselves with or to memorize NAEP exercises. 



Th'ird, lacking incentive as a potential cause of this occurring, the^re 
is very little chance of prior familiarity leading to invalidation 
simply by chance- Suppose that a released UAL- item is published 
once in every newspaper in the country (very unlikely, I suspect from 
what I know of newspaper coverage of NAEP) . Now it has been reported 
that only around 20% cof U'is. adults read newspapers every day. Among 
17-year-olds the figure is presumably lower, say 10%. If my reading 
habits are any guide (reading about half of what is in any one day's 
paper) , then it would be resonable'to estimate that only half of these 
(i.e. 5%) would read the N/iEP item. Further it seems plausible that 
even if someone saw the NAEP item, it would be unlikely (say 

IS 

a 1 in 5 chance) that the person would remember the NAEP item later 

after % substantial period, when they happened to be in a NAEP sample. 

All this would .piean that there would only b^ around a 1% chance of 

the sort of invalidation feared occurring with an item published in 

every newspaper in the country . Odds would presumably be 

much smaller for 9- and 13-year olds who would be less likely to read 

newspapers. 

All this suggests that the magnitude of error deriving from use of 
released items would be substantally less than sampling error -^already 
implicit in the NAEP design. Obviously it would be easy to come up 
with numbers different than the ones advanced above, but I note that 
several years ago when ETS did a study of the effect of readministering 

the same forms of the SAT to the same individuals, after a period 
of only 3-4 months, it was estimated that the effect was fairly 
small — well within the standard error for individuals if I recall 
correctly, 

, _ _ 5.: ■ 

•.0 A 
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Fourth many of the NAEP exercises are not 'multiple-choice but instead 
are open-ended or performance exercises, ^ Invalidation resulting from 
pripr exposure obviously is less severe with such items, 
I suggest that the re-use of released exercises should definitely be 
investigated; first by looking at cost implications and then with some pilot 
work on implications of reuse for exercise validity, (Though a severe problem 
in the latter regard, as I discuss later,, is that very little work has been 
done on exercise validity., apart from content validity) , 
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The second strategy for addressing the cost problem Would be not 
necessarily to reduce costs of NAEP, -but instead to shift them from federal 
coffers to private ones. One idea for doing this would be to somehow make 
NAEP exercises commercially available for independent use. I realize that 
some NAEP exercises have already been used in state assessments, but this 
is something which ECS has done, according to Hazlett, only at cost (a 
strategy perhaps related to be fact that the states are the constituency 
of. ECS), vniat 1 have in mind, however, is to sell NAEP sets of exercises 
(packaged in some usable form, together with interpretative materials) to 
schools, individuals, etc. as a means of paying for NAEP development costs. 
As I pointed out earlier/ this Would doubtless raise hackles among. private 
testing companies, but as I argued, there are a variety of leasing arrangements 
which might overcome such problems. Moreover, there is clear precedent for 
this sort of thing with the Adult Performance Level (APL) test developed 
at considerable federal, expense now marketed by ACT ~ and the APL is much 
worse than NAEP exercises I have seen. Finally, quite apart from fiscal 
considerations, I think there are important educationl reasons for selling 
NAEP items, or at least for making them far more widely available. 
Design and Technical Issue s. 

I have many observations and suggestions in this realm, but for the 
moment I will stick siinply to points of major concern regarding validity 
and reliability. 

As long as NAEP exercises were interpreted strictly one-by-one, without 
much attempt to aggregate results, I think that there were relatively few 
problems with regard to validity' and reliability. However, now that results 
are regularly aggregated across objectives, sub-objectives, subject areas, 
etc., it seems to me that a major unresolved question concerns the validity 
of inferences which can be drawn from aggregated results. 



NAEP validation procedures cbnstitute what most people would call content 
validation. As several prominent observers CsUfeh as Miessick and Cronbach) 
have argued in recent years, however, content validation 4oes not necessarily 
constitute validity evidence at all, if we mean by this^ evidence bearing 
on the interpretations 'that are warranted on the basis of test or assessment 
results. If we accept this, line of argument, it means that far more work 
to be done on the construct validation o£ NAEP results as they are now 
commonly interpreted ~ i.e. with results aggregated across exercises. 

■ This is not simply an academic issue. Though I have stated previously 
that I thought this observation was very important, I think it even more ^/^ 
so now that I have had a chance to inspect many o£ the released NAEP exercises 
What I found was that the relationship .between exercises and the objectives 
and siib-objectives to which they were assigned SGemtd extremely tenuous in 
many casest Thus, as part of construct validation of NAEP exercise sets, 
I ^suggest that additional content validation work needs to b^done. This 
suggestion, by the way should not be taken as a criticism of NAEP only, for 
these problems (that is, tenuous relationship between " objectives and items, 
and lack of construct validity evidence) seenrto me to be very common among 
so-called criterion-referenced or objective-referenced tests. 
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Construct validation is, of course, normally thought of as being carried out 
with respect to tests or subtests , but any of the construct validation strategies 
applied at the higher levels of aggregation also could be applied at the 
item or exercise level. ■ ' 

The second general technical concern I have also relates to the interpretation 
of aggregate NAEP results rather than interpretation of only exercise level 
results. My concern is that if NAEP results are to be interpreted above 
the exercise level (e.g. in terms of sets of exercises such as those pertaining 
tb literal comprehension or inferential comprehension.) , then considerable 
work could usefully be done on the generalizability of NAEF results. Here 
I refer to generalizability theory as opposed to classical reliability theory. 
Without getting into a long discussion of generalizability theory (wfiich 
by the way appears to be receiving considerable emphasis in the new joint 
test standards) , let me try to explain briefly the general nature of my concern. . 

NAEP has long been using jack-knife estimation procedures for calculating 
standard errors of measurement. In several ways this practice is eminently 
praiseworthy. It appears to yield, for example, more appropriate estimates 
(implicitly taking into account multiple stages of sampling) than would procedures 
assuming simple random samples. By and large, I would have little quarrel 
with the practice of applying the jackknife procedures at exercise level 
(and avoiding further aggregation) . The reason is that when results are 
reported in the form of say 50% getting exercise 2 correct when administered 
under XY conditions, and the exercise is presented along with the result, : 
there is little potential for misinterpretation. This marine?: of interpretation 
makes it quite clear that results pertain to a particular exercise given ' 
under particular conditions. In the lariguage of generalizability theory, 
the exercise and administrative conditions facets and variance are assumed 
to be fixed — and fixed very narrowly and specifically. 
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However, now that NAEP^^ "In addition to providing results on individual 
items," also "reports the average, performance across gitoups of similar items — 
for the .learning area as a whole, for a particular theme, objective or sub- 
objective, and so on'^ (NAEP, Reading , Writing , and Thinking report, 1981, 
p. xiii) , it seems to me that the jackknife procedure as previously employed 
is jio longer adequate. 

There are several ways of explaining my point, but for the sake of explication 
-let me briefly set out only one. As generalizability theory makes clear 
(and classical reliability theory does not) many facets or sources of variance 
can affect assessment results (e.g. tasks or exercises, administrative; conditions, 
samples tested, scoring procedures, etc.). The problem' with NAEP's jackknife 
procedure, as previously applied is that it assumes (at least as I unaexstand 
it) that the only facet of error variance is the sample of individuals tested. 
This is not an \inreasonable assumption when results are interpreted at the 
individual exercise level; but it seems to me potentially quite misleading 
when results are reported in terms of sets of items labelled such as "literal 
comprehension," "inferential comprehension" or "reference skills." 

The reason is that other facets can and do contribute to variance in 
results in such areas: Exercise content and format, administrative conditions, 
and scoring all can contribute to error variance. Indeed, NAEP itself has 
pointed out that such facets can contribute substantially to variation in 
performance. For example, in the 1981 report Reading , Thinking , and Writing , 
it was written in summary that: , 

The nature of a particular passage has a strong, shaping influence on 

the -characteristics of students' responses. 

Item formats also have a major influence on students' performance, 
(p. 3) 
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If this is so, then the problem of interpretation for NAEP is th^t facets ^ 
of variance concerning content samples, format samples, etc. are not fixed.* 

This suggests to me that NAEP needs to become far more clear in reporting, 
aggregate results in specifying the faceSts of variance over which it is and 
is not attempting to generalize. In the language of generalizability theory 

, there is a need for becoming far more clear in specifying the universe of 
observations and conditions over which generalizations are being drawn (intentions 
with respect to generalizing across people are relatively clear) . 

Again, I strongly suspect that this is often more than a merely academic 
issue. It certainly appears to me '(though this is an issue that, I have not 
had 'time to track down thoroughly) that NAEP results may vary considerably 
more in terms of ^samples of exercises aggregated under the same label, than ^ 

' in terms of cycle of assessment. In otheif words, it may bfe that the contejit- 
and j|onnat facets of vaj.'iance are more important than the year or cohort ^ 
facet of variance when results are aggregated across sets of exercises. 




♦strictly speaking, NAEP does cover itself on this point, maintaining that 
aggregate results pertain only to ''specific sets of exercises," but the 
manner in which sets of .exercises are labelled (e.g. "inferential comprehension, 
as opposed to exercise set* X) belies this disclaimer. 

11 
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Utility . - . 

'Almost every reviewer of NAEP (e.g. Greenbai^n^^|lett , GAO, Wirtz & . 
Lapointe). has observed that NAEP results have^^t proven as useful as they 
might be. It is true, I think, .as Sebring & Boruch observed of Wirtz & 
Lapointe that some of the conclusions regarding NAEP!.s utility have been 
. rather cavelierly ^rawn. Nevertheless, given the substantial costs of NAEP, 
it obviously is worth considering ways in whiQh NAEP could be made more 
useful. Here I would like to suggest two strategies: 1) developing norms 

for NAEP exercises, and 2)' making NAEP exercises and data more readily ^ 

' , " . \ . " . ' „ ■ 

' accessible to independent investigators. 

*• * 

These suggestions ^re premised on the assijmption that th'ere are two 
broad types of potential use of NAEP: one by educators and -evaluators using 
NAEP exercises^ andv interpretative materials for their own p\irposes; and two 
by^ researchers and other invertiga'tors independent of NAEP ufeing NAEP data. 
In essence, of course, I am suggesting that the best strategy for enhancing 
the utility of NAEP may be a decentralized one — that 'is not more interpretation 
and report peddling by NAEP itself but instead , more promotion of NAEP products - 
(i.e. exercises and performance 'data on tfie exercises). 

On the first type of potential use, I Kave already sugge«i.ed selling 
sets of NAEP exercises for independent use as a means of making NAEP pay ' 
more of its own* way. The way NAEP is presently organized, 'however, I^doubt 
that such an effort would be terribly successful'. Why? Because at present 
there is no^ attractive .way\ to make sense of fthe meaning of NAEP e^xercises. 
Indeed; it is quite revealing I think that NAEP's own framework. for organizing 
exercises has changed over time — from objectives to content by taxonomic • \^ 
level matrix (in at least some cases) . - , 

The most obvious way in which to make sets of NAEP exercises more 
attractive is to develop and publish norms for them. At first, this suggestion 
O may seem heretical to anyone familiar with the origins of NAEP. However, • >. - 

^ 12 ■ 
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on a theoretical levels there are two considerations which indicate that 
developing norms for sets of NAEP exercises would not be as heretical as 
it first might appear. First, despite some of the rhetoric in the early 
days of NAEP against norm-referenced testing (e.g. Tyler arguing that selecting 
items in terms of difficulty and discrimination can lead to important items 
and objectives being overlooked) , it is quite clear that NAEP has never 
entirely done away with normative considerations i^n selecting ;iexfercises. 
One-third were to be ea'sy, one-lihird hard and one-t^iard of middling ^difficulty. 

Second and perhaps even more important, it is vit^l to distinguish b^jtyeen 
construction of test and assessment instruments and their interpretation . 
As is being increasingly recognized nowadays any test result, be it derived 
from so-called criterion- or objectives-referenced tests or from^ a "norm- 
referenced" test, can be interpreted in either criterion-referenced or norm- 
referenc'ed fashion. Thus, sets of NAEP exercises could be interpreted in 
norm-referenced fashion (indeed they already are, as in deviation scores 
of regional averages from the national mean) without undercutting^ the 
distinctive character of NAEP's exercises as being developed and selected 
in terms of specific objective or content by cognitive level specifications. 

Availability of norms for sets of NAEP exercises would greatly enhance 
their utility for educators and evaluators, I suspect. Moreover, developing 
national norms for sets of NAEP exercises could be of substantial practical 
interest. First, as Cooley and f<)hnes have .pointed out, because of its 
sampling procedures, NAEP has the potential for developing norms which are 
much more truly representative nationally than those of. any of the commerical 
test publishers. Second, NAEP norms might shed considerable light on the , 
debate over norm-referenced versus criterion-referenced testing. Several 
small-scale Tieces of research have clearly shown that selection of test 
items in terms of 'prevailing standards of norm-referenced test construction can 
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bias' the content coverage of a test. Howe-'er, what has not been systematically 

investigated is whether or not a test such as NAEP, constructed in terms 

of objectives (with no screening applied in terms of item discrimination) , 

would yield substantially different norm-referenced results (e.g. in whit-e ^ 

versus black comparisons or male versus female) than a test constructed in 

norm-referenced fashion. Presumably normative differences could be smaller 

or larger, but whichever the case the results might be of considerable practical 

interest. 

Beyond theoretical and practical issues, making NAEP exercise sets available 
for local use with norms as aids for interpretation might, I think be of 
considerable educational interest. The reason is simply that NAEP has invested 
a tremendous amount of time, energy and expertise in developing exercises 
for educational goals whiclx_hav_^J3een. large-ly overlooked by commercial publisher-^ 
(mainly for economic reasons I suspect) . Making high-quality exercise sets 
available for such ^reas as music and art, which too often are neglected 
when it comes to assessment could be of substantial educational value. 

There are, of course, many different ways in which norms could be_developed__ 
for NAEP exercise sets. Some possibilities, such as grade equivalent norms, 
obviously should be avoided. However, there are a number of other possibilities 
\age and grade norms interpreted in terms of standard scores, percentiles, 
growth curve norms etc.) which might be considered. 

The second idea I would suggest for making NAEP more useful would be 
to make exercises and performance data more accessible. There are many ways 
in which this might be done (indeed, the idea already discussed, of selling 
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sets of NAEP exercises is one strategy). However, as a first step in making 
NAEP exercises and results accessible what I would suggest is the development 
of a comprehensive index to NAEP exercises, svirveys, reports and data tapes. 
In elaborating on this idea, let me first provide examples of why I think 
NAEP data are not very accessible, and then describe ways in which an index 
might make them more accessible. 

First, the problem. Though I have long^een interested in NAEP, only 
recently have, I begun to review detailed information on NAEP and NAEP exercises. 
One thing I have done is to begin reviewing sets of released NAEP exercises. 
As I started doing this, I was struck by two things. First, as already mentioned, 
the connection between NAEP exercises and objectives seemed to me very tenuous 
in many cases. Second, the NAEP classification of exerc ises seemed to camouflage 
information on exercises which were of far more general interest than one 
would suspect by looking merely at the objective under which they were classified. 
This seemed particularly so in the case of open-ended exercises. 

These considerations suggest to me that what would be very helpful would 
be a comprehensive index to NAEP exercises, surveys, reports and data tapes. 
Some of this information already exists I realize, for example in the^identif ication 
numbers to NAEP exercises. However, it seems fairly clear to me that more 
thorough indexing might make NAEP exercises more accessible. Exercises might 
be classified for example, not only in terms of objectives on subject areas, 
but alr-o in terms of vocaibulary used in the exercise itself, in coding of 
open-ended exercises, in terms of response format (e.g. multiple choice, 
or open ended, written or verbal, administrative conditions etc.. There are 
of course, many other dimensions in terms of which NAEP exercises, surveys, 
reports and data tapes might be coded and thereby indexed. I cannot even 
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begin to mention most possibilities here, "ftence, let me close simply by 
reiterating my general point that classification of NAEP exercises in terms 
of subject areas and objectives seems to me quite tenuous, and that classification 
of NAEP materials and data more thoroughly, from several different perspectives 
mTght"maJceirAEP more useful to people with diverse interests — interests 
which often may not coincide with NAEP's objectives or ^content-cognitive level 
framework, but which might nevertheless be illuminated by the unique data 
set which NAEP has accumulated. 



